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Abstract 

Background: Protein-protein interaction networks and phenotype similarity information have been synthesized 
together to discover novel disease-causing genes. Genetic or phenotypic similarities are manifested as certain 
modularity properties in a phenotype-gene heterogeneous network consisting of the phenotype-phenotype 
similarity network, protein-protein interaction network and gene-disease association network. However, the 
quantitative analysis of modularity in the heterogeneous network and its influence on disease-gene discovery are 
still unaddressed. Furthermore, the genetic correspondence of the disease subtypes can be identified by marking 
the genes and phenotypes in the phenotype-gene network. We present a novel network inference method to 
measure the network modularity, and in particular to suggest the subtypes of diseases based on the 
heterogeneous network. 

Results: Based on a measure which is introduced to evaluate the closeness between two nodes in the 
phenotype-gene heterogeneous network, we developed a Hitting-Time-based method, CIPHER-HIT, for assessing 
the modularity of disease gene predictions and credibly prioritizing disease-causing genes, and then identifying the 
genetic modules corresponding to potential subtypes of the queried phenotype. The CIPHER-HIT is free to rely on 
any preset parameters. We found that when taking into account the modularity levels, the CIPHER-HIT method can 
significantly improve the performance of disease gene predictions, which demonstrates modularity is one of the 
key features for credible inference of disease genes on the phenotype-gene heterogeneous network. By applying 
the CIPHER-HIT to the subtype analysis of Breast cancer, we found that the prioritized genes can be divided into 
two sub-modules, one contains the members of the Fanconi anemia gene family, and the other contains a 
reported protein complex MRE1 1/RAD50/NBN. 

Conclusions: The phenotype-gene heterogeneous network contains abundant information for not only disease 
genes discovery but also disease subtypes detection. The CIPHER-HIT method presented here is effective for 
network inference, particularly on credible prediction of disease genes and the subtype analysis of diseases, for 
example Breast cancer. This method provides a promising way to analyze heterogeneous biological networks, both 
globally and locally. 



Background 

Disease gene prediction is one of the most important 
aims in biological and medical sciences. Network-based 
evidence as well as inference approaches has become 
more and more attractive in the research field of 
disease-causing gene discovery, and a variety of methods 
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have been developed recently from this point of view 
[1-5]. Researchers also attach great importance to spe- 
cial features embedded in biological networks especially 
the protein-protein interaction (PPI) network for deeply 
understanding molecular mechanism of common 
human diseases [6-15]. Since genetic diseases are geneti- 
cally or phenotypically similar, it is promising to com- 
bine the protein-protein interactions and the phenotype 
similarities to a phenotype-gene heterogeneous network 
to infer the candidate disease genes [1-4]. The so-called 



Yao et al. BMC Systems Biology 201 1 , 5:79 
http://www.biomedcentral.eom/1752-0509/5/79 



Page 2 of 1 1 



"phenotype-gene heterogeneous network" reflects a hol- 
istic view of complex relationships among various phe- 
notypes and phenotypes, phenotypes and genes, as well 
as genes and genes, which consists of the phenotype- 
phenotype similarity network, gene-disease association 
network and protein-protein interaction network, 
respectively. Based on such a heterogeneous network, 
we propose a regression model named CIPHER (Corre- 
lating protein Interaction network and PHEnotype net- 
work to pRedict disease genes) to quantify the 
concordance between candidate genes and target pheno- 
types [2], The algorithm of random walk is also pro- 
posed to prioritize the candidate disease genes in 
protein-protein interaction networks [3] and then a ran- 
dom walk with restarts (RWR) method is extended to 
the above heterogeneous network [4]. 

In general, the network-based disease-gene discovery 
methods make use of information from both the topolo- 
gical structure and the associations between diseases and 
genes. The basic assumption is that similar disease phe- 
notypes are caused by functionally related genes and 
these genes are likely to be close to each other on the 
protein-protein interaction networks, so that network 
modules are formed [15-18,5]. Here the network module 
in computation refers to a group of genes exhibiting net- 
work proximity, and in biology refers to certain func- 
tional units such as protein complexes, signaling or 
metabolic pathways and transcriptional programs 
[16-19,5]. Therefore, the algorithms in [3] prioritize can- 
didate genes based on their closeness to known disease 
genes. After the similarity information between the phe- 
notypes is provided by van Driel et al. through text 
mining technology [17], the phenotype similarity and the 
protein-protein interactions are combined together for 
the prioritization of the candidate disease genes [1,2,4]. 

However, so far little modularity analysis on the phe- 
notype-gene heterogeneous networks has been done. 
The predicted results from the network inference meth- 
ods need to be tested to see whether they form the 
modules and to which corresponding biological function 
they are related. In this paper, the network inference 
methods are further developed to measure the modular- 
ity property of the disease-gene prediction results. 
Furthermore, we also provide the method to infer the 
relationship between the subtypes of diseases and the 
modules formed by these predicted results. 

Inference on the phenotype-gene heterogeneous network 

For the network-based inference, a candidate gene g is 
prioritized to be a potential disease-causing gene of the 
target phenotype p if one or both of the followings are 
satisfied: 

1. The gene g is close to some disease-causing genes 
associated with p. 



2. The gene g is close to some phenotypes which are 
highly similar to p. 

Hence one key point is to define the closeness 
between two nodes in the network, and this will be used 
to measure the similarity between the nodes based on 
the network topology [1,2]. Currently the nearest-neigh- 
bor method considers the direct interactions informa- 
tion and ignores the long-range interactions. The 
shortest-path method considers the length of the short- 
est path connecting two nodes but ignores the number 
of short paths between them. The random walk with 
restart method [3,4] combines the local and global net- 
work information to enhance the prediction 
performance. 

Another key point is the priori information known 
about each target phenotype, the known disease-genes 
and the similar phenotypes. In the phenotype-gene het- 
erogeneous network, for each given phenotype p, its 
known causing genes and the similar phenotypes are 
represented as the nodes which link to p directly, and 
these nodes are termed as the adjacent nodes of the tar- 
get phenotype p in the heterogeneous network. The 
paths between p and any other nodes have to cross this 
adjacency set. Therefore, the prioritization can be car- 
ried out by measuring the closeness between the candi- 
date genes (namely all genes in the protein-protein 
interaction network) and these adjacent nodes. 

In this paper, we introduced a closeness measure 
based on the methods of Mean-Hitting-Time and condi- 
tional Mean-Hitting-Time, which not only capture the 
global relationships within the phenotype-gene heteroge- 
neous network, but also free to rely on any priori para- 
meters. Moreover, by studying the different relationships 
to different adjacent nodes, we assume that the priori- 
tized genes can be further divided into sub-modules 
which may correspond to the subtypes of the disease. 
And the conditional Mean-Hitting-Time can be applied 
to discover such disease subtypes. The present Hitting- 
Time-based method with the flowchart illustrated in 
Figure 1 is called CIPHER-HIT, as a continuation of our 
CIPHER method [2]. 

Candidate disease genes prioritization: which are the 
most credible? 

Based on the closeness measure of the phenotype-gene 
heterogeneous network, the candidate genes can be 
prioritized according to their topological similarities of 
the adjacent nodes. The inference is of the same spirit 
as the methods in [1-4]. However, some disease-causing 
genes are likely to be topologically similar, whereas 
some others will be dispersed among the heterogeneous 
network. As shown in Figure 1, for a phenotype that has 
many known disease genes and similar phenotypes, we 
probe the relationships among these adjacent nodes and 
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Figure 1 The flowchart of the CIPHER-HIT method. In the CIPHER-HIT method, we first evaluate the modularity of the adjacent nodes around 
the given phenotype, and then select credible reference for disease gene prediction by the Mean-Hitting-Time, which will be further subjected 
to detection of disease subtypes by the conditional Mean-Hitting-Time on the phenotype-gene heterogeneous network. 



suppose that if an adjacent node (a known disease gene 
or a similar phenotype to the target phenotype) has 
higher topological similarity with the others, then it will 
be a more credible reference gene or phenotype for 
inference of disease-causing genes. Here the topological 
similarity between two nodes means their closeness or 
connectivity strength on the network, which can be 
defined as the Mean-Hitting-Time of the random walk. 
We consider this hypothesis is reasonable since it is 
widely assumed that similar phenotypes may be caused 
by functionally close related genes [15,16], thus if more 
information about protein-protein interactions, gene- 
phenotype associations as well as phenotype-phenotype 
similarities is known, higher inference accuracy in gene- 



phenotype relationship inference will be achieved. As a 
graphic illustration shown in Figure 2, nodes U\, w 2 and 
u 3 will be the more credible references than m 4 since 
they are close to each other. Therefore, our CIPHER- 
HIT method developed is firstly used to measure the 
connectivity strength between one adjacent node and 
the others, and then those candidate genes near the 
credible reference will be marked as the ones being 
more likely to form modules in the network. 

Gene sets inference for the disease subtypes 

Identifying subtypes of diseases such as cancer is of cri- 
tical importance for predicting clinical outcomes as well 
as designing more-specific therapies for patients, 
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Figure 2 Illustration of the network inference and modularity 
measure in CIPHER-HIT. The circle nodes represent the genes and 
the rectangle nodes represent the phenotypes. The red node 
denotes the target phenotype p. The yellow nodes (u) denote the 
adjacent nodes of p, i.e. the set £p, referring to either genes or 
phenotypes. (A) The dashed ellipses enclose the adjacent nodes 
which share high topological similarity. The nodes u 1( u 2 and u 3 are 
close to each other. Therefore candidate nodes g,, g 2 , and g 4 , 
which can more easily form a module in the protein-protein 
network, will be prioritized as the potential disease-genes. The 
group u-i, u 2 u 3 , g,, g 2 , g 3 and g 4 will be inferred as a module 
related to phenotype p. (B) The illustration of the meaning of the 
conditional Mean-Hitting-Time Eg(lp|Tp < T£ \{ Uj }). Among the 
paths from the candidate genes g to the phenotype p, the 
influence of the paths passing the adjacent nodes other than u q are 
excluded, which are illustrated as dashed lines. 



facilitating a new era of translational medicine and per- 
sonalized medicine [20,21]. The intrinsic cancer sub- 
types have been studied in different ways by using 
histology, molecular pathology, genetic mutation and 
gene-expression information [21]. The classification of 
human cancer has become more and more informative 
as the detailed molecular analysis is provided. For exam- 
ple, the molecular heterogeneity in tumor can be recog- 
nized according to the different patterns of the gene 
expression information [20-22]. Interestingly, Li et al. 
recently reported an integrative network analysis 
method to identify recurrent network modules that con- 
tribute to Breast cancer metastasis by using a set of 
tumour gene microarrays [23]. Since molecular network 
modules have been detected in cancer subtypes [23], it 
is possible to use network modules to further classify 
Breast cancer into subtypes. 

It is well accepted that similar phenotypes may be 
caused by functionally close related genes [1-16]. An 
extension of this assumption would be that genes 
related to different subtypes are likely to form distinct 
protein-protein interaction modules, which is a common 
indicator of gene functional relationship [24]. 

Thus, our CIPHER-HIT method is further used to 
identify the sub-groups of genes corresponding to the 
cancer subtypes. Such groups of genes are called sub- 
modules in the network, and the main task of our 
method is to identify the gene sets related to different 
subtypes of a target disease (or phenotype). In cases 
where the heterogeneity information of a phenotype is 



included in its adjacent nodes, it is promising to further 
classify the prioritized genes based on such information. 
The similar phenotypes and their associated genes have 
also provided information for identifying the sub-mod- 
ules. For example, the phenotype node representing 
FANCONI ANIMIA has high topological similarity to 
the phenotype node BREAST CANCER. Recent studies 
demonstrate that genes FANCA, FANCB, FANCC, 
FANCD2, FANCE, FANCF and FANCG associated with 
Fanconi animia are closely related to the susceptibility 
of Breast cancer [25,26]. These genes can be prioritized 
to be associated with Breast cancer by CIPHER-HIT 
successfully. In addition, by discriminating the adjacent 
nodes through which these genes are prioritized, they 
can also be marked as the sub-module corresponding to 
the subtype of Fanconi animia related Breast cancer. 

Thus, in this work, we develop a method to reveal the 
relationship between each prioritized gene and each adja- 
cent node so that the hierarchical clustering method is 
applied to discover the potential subtypes of the target 
phenotype. These results are meaningful for further bio- 
medical and experimental researches, since they help to 
focus on the genes which are likely to form the sub-mod- 
ules corresponding to the potential subtypes of diseases. 

Results and Discussion 

CIPHER-HIT: the topological closeness measure based on 
the Mean-Hitting-Time 

The CIPHER method [2] and the random walk with 
restart method (RWR) [3,4] are the approaches which 
reflect the global structural information of the pheno- 
type-gene heterogeneous network, while the parameters 
such as the restart rate in RWR, which are related to 
the performance, are required to be pre-set. In the 
CIPHER-HIT method, we present a new closeness mea- 
sure between two nodes based on the Mean-Hitting- 
Time of the random walk on the heterogeneous net- 
work. Although this measure is developed from the 
same mathematical background as the random walk 
with restart method [3,4], it both reflects the global 
topological information very well and refrains from set- 
ting up a difficult-to-explain priori parameter. Moreover, 
one extension of this measure - the conditional Mean- 
Hitting-Time can be used to discover modularity char- 
acteristics on the phenotype-gene heterogeneous net- 
work and contribute to disease subtype inference. 

For a random walk on the network, the Hitting-Time to 
the set of nodes B, denoted by t B , is defined as the first time 
when B is visited. The Mean-Hitting-Time of a random 
walk starting from the node a to the set B is defined as 



E a r B = feP a (r B = k) 



(1) 



where V a (v B = /c) refers to the possibility that a ran- 
dom walk starting form node a hits the set B at a time 
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point k, and k is the summing target ranging from 1 to 
positive infinite. 

The Mean-Hitting-Time include all the router infor- 
mation between the node a and set B. We define the 
closeness measure between node a and set B by the 
scaled Mean-Hitting-Time (MHT) with the maximal 
value for all nodes a on the network. 



MHT(a, B) = 



E fl (r B ) 



max a >E a >{T B ) 



(2) 



Here E a (r B ) can be inconveniently large in actual cal- 
culation, so we scale it to ensure the range of MHT is 
between 0 and 1. 

Furthermore, if we need a topological closeness 
between the node a and the set B without the influence 
of a given set of nodes, A, the conditional Mean-Hit- 
ting-Time will be a natural choice. It is defined as 



E a (r B |r B < r A ) = Ej£ 0 feP fl (r B = k\ t b < x A ) 



(3) 



where P fl (r B = k\xs <Ta) refers to the possibility that a 
random walk starting form node a hits the set B at a 
time point k, conditioning on the same random walk 
hits the set B before it hits the set A. 

Similarly, we define the scaled conditional Mean-Hit- 
ting-Time (CMHT) CMHT(«, B\A), as the closeness 
measure between node a and set B, without the influ- 
ence of set A, 



CMHT(a,B|A) 



E fl (T B |T B < T A ) 
mflX a ^AEa/(T B |T B < T A ) 



(4) 



We also scale CMHT to the range between 0 and 1 to 
avoid the inconvenient large E a (r B |r B < r A ) in actual 
calculation. Both of the closeness measures defined in 
Equation (2) and Equation (4) can be computed expli- 
citly without any preset parameters (see detailed compu- 
tational methods in Material and Methods). 

Performance of CIPHER-HIT in credibly predicting 
diseases-causing genes 

In this work, we firstly apply the scaled Mean-Hitting- 
Time in ranking candidate disease-causing genes based 
on the phenotype-gene heterogeneous network. The 
adjacency set of a certain node n on the network is 
defined as all those nodes linked to n by an edge on the 
network, either a 1-valued association as in the protein- 
protein interaction network and gene-disease association 
network, or a positively weighed association as in the 
phenotype-phenotype similarity network filtered by a 
threshold (see Material and Methods). For each given 
phenotype p having an adjacency set £ p = {u\, ■ ■ ■ , u m }, 
we compute MHT(g,{p}) for each candidate gene g. After 
ranking MHT(g,{p}) from the smallest to the largest, a 
gene g will be prioritized as the potentially causal gene 



associated with phenotype p if MYVY(g,{p}) <6 R> where 0 R 
is the filtering threshold. The detailed setting of 6 R will 
be discussed at middle of the second to last paragraph of 
this subsection. The ranking information of each gene g 
is recorded as the ranking position RANK^,^). For the 
target phenotypes p which have many nodes in the adja- 
cency set £ p , we introduce the Modularity Level through 
conditional Mean-Hitting-Time as below: 

M p {ui) = min„ £ep \ ( „ i} CMHT(tt, {u;}|{p}), i = 1, ■ ■ ■ , m, (5) 

which can be used to test the connectivity strength 
between m, and other adjacency nodes. Note that a smal- 
ler value of the conditional Mean-Hitting-Time (M p (iii)) 
indicates a higher modularity level, namely a stronger 
connection between the adjacent node («,•) and other 
nodes in the adjacency set {£p\{Ui}). By calculating the 
minimum conditional Mean-Hitting-Time, we assess the 
modularity level of one node p on the network with 
regard to its adjacency node m, as the maximum connec- 
tivity strength between other adjacency nodes u and h,-. 
Different from the concept of topological similarity 
between two nodes, the modularity level of one node 
with regard to another takes the other adjacency nodes 
into consideration, and serves as the measure of connec- 
tivity strength among more than two connected nodes. 
Then we set a threshold 9 M to distinguish the adjacent 
nodes so that u, € £ p which satisfies M p (ui) <9 M will be 
marked as the one with high connectivity strength to 
the other adjacent nodes. 

Hence the adjacent nodes are divided into two parts, 
£ p and £ p which are defined as 



£' ={u<e£ p : M p (u) < 6 M ), 



£" ={ue£ p : M p {u) > 6 M }. 



(6) 



(7) 



According to the definition above, £ p denotes the 
adjacent nodes u including disease-genes associated with 
p or phenotypes similar to p that are strongly connected 
with each other. For any m,, Uj e £' the random walk 
starting from m, will reach Uj easily without passing p. 
This feature is illustrated in Figure 2B. 

Next, we analyze the prioritized genes for target phe- 
notype p. We measure the closeness between each gene 
to the nodes in £' p without the influence of the nodes in 
£" p . We compute CMHT(g, £' p \£" p ) for each gene g and 
then rank results from the smallest to the largest, so 
that we record the ranking position ^(g). By compari- 
son of RANK p (g) and RANK' p (g) for each prioritized 
genes, and if RANK p (g)/RANK' p (g) > 1, we conclude 
that gene g is in association with the node p because it 
is close to the adjacent nodes in set £' p , and these genes 
are marked as the most credible predicted results. 
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The performance of CIPHER-HIT is evaluated by a 
genome-wide leave-one-out cross-validation. The candi- 
date gene set is defined as all genes on the heteroge- 
neous network. The set of validated genes are the 
known associated genes of the disease phenotypes. At 
each round of the validation, one gene associated with 
the target phenotype will be chosen as a validated sam- 
ple, the link between the chosen gene-node and the 
phenotype-node is removed and the scaled Mean-Hit- 
ting-Time from each gene-node to the target pheno- 
type-node (the one from which a link is removed) is re- 
computed and ranked from the smallest to the largest. 
Note that a disease gene can be associated with many 
phenotypes. Therefore, the gene is deemed to come 
from different samples when the validation is carried 
out for different phenotypes. If a sample for validation 
satisfies MHT(g,{p}) <8 R , it will be considered a success- 
ful prediction. The results of the leave-one-out cross 
validation are shown as the receiver operating character- 
istic (ROC) curves in Figure 3, where the horizontal 
coordinates (1 -Specificity) refer to values of 6 R , and the 
vertical coordinates (Sensitivity) refer to the true-posi- 
tive rate corresponding to 6 R . The validation on the dis- 
ease genes in the set £" p produces obviously poorer 
performance than the validation on the disease genes in 
the set £ p. This is reasonable since the genes in £' p are 
likely to be close to the other known disease genes or 
phenotypes similar to p. From the results shown in Fig- 
ure 3A, we found that the higher the modularity level a 
gene to the other adjacent nodes is, the higher the suc- 
cessful rate of the validation will be. When compared 
with the random walk with restarts (RWR) method [4], 
we found that the ROC curves of both RWR and 
CIPHER-HIT are comparable. However, when taking 
into account the modularity levels, only the adjacent 
node u of M p {u) <8 M = 0.3 are used for inference in 
CIPHER-HIT method, the so-called modular CIPHER- 
HIT can significantly improve the performance of dis- 
ease gene predictions, making it possible to reach the 
credible prediction of disease genes (Figure 3B). 

Note that though we mark the prioritized genes that 
are close to the adjacent node in £ p , we do not exclude 
the other prioritized genes. The nodes in £" p are also 
available to form modules with other genes but they 
might not be exhibited because of the incompleteness of 
the network information. Since the genes in £' p already 
exhibit the inclination to have tight relationship, we sug- 
gest the marked genes be selected for further biological 
investigation with high priority. 

Disease subtype inference by CIPHER-HIT 

The development of a reliable method to identify disease 
subtypes will not only enhance our understanding of 
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Figure 3 Results of the genome-wide cross-validation for 
disease gene prioritization. (A) The conditional Mean-Hitting-Time 
(Mp(g)) is calculated by Equation (5). Results showed that genes with 
high modularity levels to the other adjacent nodes with small M p (g) 
values will be more likely to be successfully prioritized during the 
validation. (B) The receiver operating characteristic (ROC) curves of 
the genome-wide leave-one-out cross-validation. The horizontal 
coordinates (1 -Specificity) refer to values of 0 R , while the vertical 
coordinates (Sensitivity) refer to the true-positive rate corresponding 
to Or. The red solid line denotes the inference of modular CIPHER- 
HIT based on the nodes in £ p i.e. only the adjacent node u of M p 
(u) <9 M = 0.3 are used for inference. The dashed lines both denote 
the inference based on the nodes in £ p i.e. only the adjacent node 
u of M p (u) > 0 M = 0.3 are used for inference, where the blue 
dashed line denotes results from the random walk with restarts 
(RWR) method, and the red dashed line denotes results from the 
CIPHER-HIT method. 



disease mechanism, but also provide principles for design- 
ing a tailored diagnosis and treatment for patients. For a 
long time, identification of disease subtypes by phenotype 
associations of patients is of highly importance for assign- 
ing individual treatments in the medical community, espe- 
cially in traditional Chinese medicine which holds "Bian- 
ZHENG-Lun-Zhi" (Syndrome differentiation and treat- 
ment for disease) as its core concept [27] . Inspired by such 
a rationale [27], we further note that in the heterogeneous 
networks, the adjacent set of a target phenotype can be 
used not only to predict potential disease-causing genes, 
but also to reveal further structural relationships among 
the genes with regard to their contributions to disease 
phenotypes. If the prioritized genes of a query phenotype 
can be further grouped into several classes according to 
different functions, then the sub-modules in the network 
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are expected to be distinguished to correspond to these 
sub-groups of genes. 

Thus, in the framework of CIPHER-HIT, given a 
queried phenotype p, suppose its adjacent node and 
prioritized gene set are {u lt —, u m } and {g lt — , g^}, 
respectively, then we define 



Cpfe Ui) = CMHT(g, {Ui}\£ p \{ui}), i 



(8) 



which measures the closeness between the gene g and 
the adjacent node m, without the influence of the other 
adjacent nodes. Note that the selection of prioritized 
genes set {g lt -, g/} here is addressed by fitting a thresh- 
old 0 R in the step of disease gene prioritization. Since 
we filter credible disease gene set by the Mean-Hitting- 
Time MHT(g, {/?}), we naturally choose the threshold as 
the critical point of the empirical distribution function 
of MHT(g, {p}) for all genes on the network (See case 
study for Breast cancer). Then, as shown in Figure 2B, 
the value c p (g, U\) will only depend on the path connect- 
ing gene g and p trough the adjacent node U\, without 
considering the paths passing other adjacent nodes u 2 , 
u-i, — . After computing c p {g, w ; ) for all the adjacent 
nodes of p, we can get feature vectors of the prioritized 
genes g. By the alignment of such feature vectors of all 
the prioritized genes, we obtain the following matrix 



C : 



Cpfe'Wl) • • • Cp{g k ,U m ) 



(9) 



Next, the classification of the prioritized genes can be 
done by diagonalization of the matrix C in Equation (9) 
by using the hierarchical clustering method. Further- 
more, after matrix diagonalization, suppose the genes 
are divided into groups Gi, G 2 —,G/, and the adjacent 
nodes are divided into <? P/ i,£ Pi 2, • • • ,£ p ,k, then it is pro- 
mising to analyze the subtypes of the phenotype p based 
on such divisions. And the resulted sub-groups of dis- 
ease genes are likely to be related to the functional units 
of disease subtypes. 

Finally, we statistically analyze the subgroups of genes 
to evaluate whether they are separable in terms of net- 
work topology. We calculate the Mean-Hitting-Time 
between pairs of predicted disease-causing genes, either 
within the same subgroup or between different sub- 
groups, to assess the topological similarity. The Fisher's 
exact test [28] is employed to access whether gene pairs 
within the same subgroup are more topologically similar 
than gene pairs in separate subgroups. 

A case study on Breast cancer subtype detection 

Breast cancer is known to be a carcinoma with highly 
heterogeneous [21] and its heterogeneity is more com- 
plicate than the results suggested by histopathological 



analysis alone [29], so it became necessary to find more 
molecular evidence to distinguish Breast cancer sub- 
types. Therefore, we take "Breast cancer" as a typical 
case to evaluate the performance of CIPHER-HIT for 
detection of disease subtypes. 

As shown in Figure 4A, the credible disease genes for 
Breast cancer predicted by CIPHER-HIT were filtered 
by the critical point of threshold 6 R = 0.96 and resulted 
in a total of 155 credibly prioritized genes. Interestingly, 
by classification of the adjacent vectors described above, 
we found that it is worthwhile to note that 53 of the 
prioritized genes of Breast cancer can be divided into 
two groups (Figure 4B and 4C). The group containing 
the members of the Fanconi anemia gene family are 
tightly connected to the phenotypes FANCONI ANE- 
MIA (OMIM ID: 227650), ATAXIA TELANGIECTA- 
SIA (OMIM ID: 208900), BREAST CANCER 1 GENE 
(OMIM ID: 113705), XERODERMA PIGMENTOSUM 
(OMIM ID: 278700) and the disease gene BRCA2. 
Another group is tightly related to the disease genes 
BRIP1, BRCA1, NBN and RAD51. BRCA1 is shared by 
both groups. In addition, the adjacent nodes of Breast 
cancer are divided into two parts, each of which leads to 
a sub-group of genes representing a subtype of Breast 
cancer. The two subtypes with genes obtained by the 
predictions of CIPHER-HIT not only have significant 
difference in topological features by Fisher's exact test 
(P < 0.0001 for both subtypes, see Table 1), but also 
yield agreements with the evidence reported by recent 
studies [25,26,30-34]. For example, the genes RAD50 
and MRE11A in one of the predicted sub-groups are 
reported to form a protein complex related to Breast 
cancer [30]. Moreover, genes in the other predicted sub- 
group consist of FANCA, FANCB, FANCC, FANCD2, 
FANCE, FANCF and FANCG, which belong to the Fan- 
coni anemia gene family, have been shown to be risk 
breast cancer susceptibility genes and contribute signifi- 
cantly to breast cancer predisposition [25,26]. The 
importance of genes involving in this subtype of Breast 
cancer is also supported by recent studies. For example, 
the polymorphisms of CYP19A1 (the aromatase gene) 
are closely related to the status and expression levels of 
estrogen receptor (ER) [31-33], HER2/neu [34] as well 
as progesterone [35]. Therefore, we suggest that the 
subtypes predicted by our method may serve as impor- 
tant genetic determinants that can influence the devel- 
opment of the well-known subtypes of breast cancer 
such as ER positive/negative, HER2 positive/negative, or 
progesterone receptor positive/negative [36,37]. 

Thus, the case study of Breast cancer shown in Figure 4 
provides evidence that the connectivity features of the 
phenotype-gene heterogeneous network can be used to 
distinguish the molecular bases related to different dis- 
ease subtypes and lead to novel findings. And the 
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Figure 4 Two subtypes of Breast cancer detected by CIPHER-HIT. (A) The empirical distribution function of MHT(g,{p}) where p denotes the 
BREAST CANCER and g denotes all genes on the network. The 0 R threshold = 0.96 at the critical point is selected in the Breast Cancer case. (B) 
The rows represent the similar phenotypes and disease-genes associated with Breast cancer and the columns represent the prioritized genes. 
The grey color indicates the closeness between an adjacent node and a prioritized node measured by the conditional Mean-Hitting-Time. 
Therefore the prioritized nodes are divided into two clusters in which the gene names of the nodes are displayed by red and blue respectively. 
(C) The yellow squares are the phenotypes with high similarity to Breast cancer and the yellow circles are the disease-genes associated with 
Breast cancer. For a better illustration, we left out two phenotypes (PI 20435 and PI 76807) in (B) with no connections to other nodes in the 
selected network. The blue and red circles denote two groups of prioritized genes by CIPHER-HIT. The module related to FANCONI ANEMIA 
locates in the cluster colored red and we added such a phenotype FANCONI ANEMIA in the graph. The protein complex RAD50/MRE1 1 A/NBN 
locates in the cluster colored blue. 
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Table 1 Statistical measures for the predicted two 



subtypes of Breast cancer* 



Disease 
subtypes 

(Disease 
subgroup) 


Number of gene pairs 
with high topological 
similarity 
MHT(g, g') <0 R ) 


Number of gene 
pairs with low 
topological 
similarity 
MHT(g, g') >0 R ) 


P value* 


Within 


56 


64 


P,<0.0001 


subgroup 1 








Within 


333 


570 


P 2 <0.0001 


subgroup 2 








Between 


128 


480 





subgroups 
1 and 2 



*: We assess the modularity level of the predicted disease subtypes by 
comparing topological similarity of gene pairs within each subgroup to gene 
pairs between the two subgroups. 

#: The P value of disease subgroup 1 (P^ and the P value of disease subgroup 
2 (P 2 ) are calculated using the Fisher's Exact Test. 

CIPHER-HIT method could serve as an important com- 
plementarity to current approaches for identification of 
cancer subtypes. If the prioritized genes of a queried phe- 
notype are further divided into sub-groups which are 
related to subtypes of the disease, then we call each sub- 
group of genes as the susceptible modules of disease 
subtypes. 

From the above example, it can be seen that the poly- 
morphism of the cancer is related to a group of genes, 
instead of a single gene. We propose to characterize the 
subtypes of a disease by distinguishing the associated 
gene groups. If the adjacent nodes of a given phenotype 
exhibit a genetic or phenotypic difference, namely the 
prioritized genes can be divided into several sub-groups 
according to their relations to the adjacent nodes, it is 
likely to reveal subtypes according to a sub-division. 
Our work demonstrates that the disease subtype analysis 
can be carried out in the network context and benefit 
from the integration of phenotype and gene heteroge- 
neous information. We also show that the modularity- 
based method, CIPHER-HIT, is a promising way to dis- 
cover the subtype-associated genes based on the hetero- 
geneous network structure. Based on the prioritization 
information on the gene sets, the results will allow for 
further clinical and experimental researches. 

For the limitations of the present work, the CIPHER- 
HIT method currently only restricts on the genetic level, 
makes use of relatively simple data resources, and does 
not consider the quantitative analysis for gene expres- 
sions. As one of the future research directions, more 
efforts are still need to be made to evaluate the perfor- 
mance our method on different data, especially include 
quantitative information such as microarray and proteo- 
mics data for discovering disease mechanism in the 
gene expression level or protein level. An extension of 
our method to the systematic identification of disease 
subtypes also needs to be developed. Moreover, we 



believe that the method can also be easily generalized to 
enable the credible prediction of drug targets and detect 
the pleiotropic effects of drugs in our drugCIPHER fra- 
mework [38] if we combine drug targets information 
into the phenotype-gene heterogeneous network. 

Conclusions 

In summary, in this work, we introduce a concept of 
modularity level and propose a CIPHER-HIT method to 
use the Mean-Hitting-Time to measure global closeness 
between nodes of the heterogeneous network that con- 
sists of both genes and phenotypes. This measure has 
solid mathematics foundations and is easy to calculate. 
Based on this measure, we proposed a method to select 
high confident neighbors of a phenotype and detect 
gene modules that are highly connected to these high 
confident neighbors. Therefore the modularity of priori- 
tized genes can be revealed, which may provide more 
mechanistic insights to the phenotype-genotype associa- 
tion. We also demonstrate that the performance of dis- 
ease gene predictions is improved significantly by 
combining the modularity measure into the network 
inference, suggesting modularity is one of key features 
for network-based credible prioritization of candidate 
disease genes. Moreover, by detecting the sub-modules 
in the heterogeneous network, we revealed the poten- 
tially genetic and phenotypic properties of cancer sub- 
types. We believe this method can also be explored to 
predict biomarkers associated with disease subtypes in 
the gene expression and protein levels, as well as detect 
the pleiotropic drug actions in the future. 

Materials and methods 

Dataset and the heterogeneous network 

We used the following three data sets to form the three 
parts, namely the phenotype-phenotype similarity net- 
work, protein-protein interaction network and gene- 
disease association network, of the phenotype-gene 
heterogeneous network based on which the prediction 
was carried out. 

♦ The Human Protein Reference Database (HPRD) 
[39] was adopted to construct the protein-protein inter- 
action network. The largest component of the HPRD 
protein-protein interaction network contains 34364 
edges and 8503 vertices. 

♦ The phenotype similarity came from the results cal- 
culated by van Driel et al. [17]. The phenotype similarity 
network contains 5080 phenotypes. 

♦ The associations between the phenotypes and genes 
were from the OMIM (Online Mendelian Inheritance in 
Man, http://www.ncbi.nlm.nih.gov/omim) records as 
described in precious studies [2,4]. The edge weights of 
this phenotype-gene sub-network will be defined in 
Equation (10). 
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The heterogeneous network was described by the 
weight matrix. We constructed it by merging the weight 
matrices of the sub-networks into one matrix. Let W G 
denote the weight matrix of the HPRD network. For any 
two genes gi and g 2 , if there was a corresponding pro- 
tein-protein interaction recorded in the HPRD database, 
then W G {g lt g 2 ) = 1, otherwise W G fe, g 2 ) = 0. 

The phenotype similarities were used as the descrip- 
tion of the diseases relations. The same data as previous 
works [2,4] were used, where the phenotype similarity 
data were calculated by van Driel et al. [17]. Since the 
high similarities were only present between parts of phe- 
notype pairs, we set a threshold to filter out very low 
similarity values. Let W p denote the weight matrix of 
the phenotype-phenotype similarity network. If the simi- 
larity value between two pheno types p\ and p 2 was lar- 
ger than the threshold 0.4, then the weight W p (pi, p 2 ) 
took this similarity value, otherwise W p {pi, p 2 ) = 0. 

The phenotype-gene associations were taken from the 
same data set as [2,4]. If there was an association 
between phenotype p And gene g, then we specified the 
weight of the corresponding edge as 



W A (S,P) 



(10) 



by which we can achieve that for each pair of asso- 
ciated gene and phenotype (g, p), the average possibility 
of "walking" onto a different sub-network at the point g 
and p in the random walk process will equal 0.5. 

Thus, the weight matrix of the heterogeneous network 
was constructed as 



W = 



WjW G 



(11) 



where WT refers to the transpose of W A . 

We defined the random walk according to the weight 
matrix described as Equation (11) and carried out the 
network inference on it. 



on the heterogeneous network. The math formula expres- 
sions below are mainly adopted from [40,41]. 

The random walk on the heterogeneous network was 
constructed by specifying its transition probability matrix 
P based on the weighted matrix W in Equation (11). 



P(j,j) = _M, where W, = EjW(i,j} 



(12) 



The Mean-Hitting-Time from other nodes to a given 
node p could be obtained by solving the following Equa- 
tion (13) 



(I-P)x(v) = l, v^p 

x(p) = 0, Otherwise, 



(13) 



where / refers to the identity matrix, and x(v) refers to 
the vth component of vector x. 

The non-negative minimum solution 
{x(v) = E v (r p ) : v € V} gave the Mean-Hitting-Time 
from all other nodes, both the gene-nodes and pheno- 
type-nodes, to the given phenotype-node p. Further- 
more, the conditional Mean-Hitting-Time 
Ev(Tp|r p < Tfi) could be computed by solving 



P)y(v) 



0, 



Tb), 



v?BU {p}; 
Otherwise, 



(14) 



where F v (t p <r B ), termed as the harmonic potential in 
the Markov Process theory, is the probability that a ran- 
dom walk starting from v reached p before B. The har- 
monic potential could also be obtained from the 
minimum non-negative solution of 



[I- 



P)z(v) = 0; 

4P) = h 
z(v) = 0, 



vg{p}UB 



v € B 



(15) 



The theoretical proof of Equations (13), (14), and (15) 
is referred to [40,41]. 



The Mean-Hitting-Time and conditional Mean-Hitting- 
Time in CIPHER-HIT 

In the previous random walk with restart method [3,4], the 
stationary distribution is used to define closeness between 
two nodes on a network. Here we define the topological 
properties on the phenotype-gene heterogeneous network 
in the same mathematical background using the Mean-Hit- 
ting-Time of the random walk. This definition is more sui- 
table in solving the problem of both disease-causing gene 
inference and disease subtype inference, because by adopt- 
ing this measure, we no longer have to choose the priori 
parameter required in the former method (which was 
always assumed to be arbitrary), and this measure leads us 
to a natural way of discovering modularity characteristics 
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